The EWAS Catalog contains thousands of associations from hundreds of studies. By far the most common method of measuring DNA methylation amongst these EWAS is in blood using the Illumina Infinium HumanMethylation450 BeadChip (HM450 array). This platform assays fewer than 2% of CpG sites in the human genome, and those selected are ascertained for regions hypothesised to be relevant to gene regulation. Understanding what drives the associations found by measuring DNA methylation in this way could help prioritise CpG sites or regions of the genome to target for future technologies used in EWAS and further, it could guide current EWAS study design (for example by discovering sites which could be removed before analysis).
Data were taken from The EWAS Catalog in June 2021. Results/studies that did not meet the EWAS Catalog inclusion criteria were removed. Studies were removed that compared DNA methylation levels between tissue, race, and age. This left 2528 EWAS with 386,096 results at P < 1x10-4.
To allow testing association between EWAS effect estimates and CpG characteristics across traits, beta coefficients were standardised, \(\beta_{standard}\), like so,
\[\begin{equation} \beta_{standard} = \frac{\beta\sigma(x)} {\sigma(y)} \tag{1} \end{equation}\]
where \(\beta\) = beta coefficient, \(\sigma\) = standard deviation, \(x\) = independent variable, \(y\) = dependent variable. As individual participant data were not available to us, the variance in DNA methylation sites was approximated by the variance in DNA methylation at sites as supplied by the Genetics of DNA Methylation Consortium (GoDMC) and the trait variance was estimated by rearranging equation (2) depending on whether DNA methylation was the independent (\(x\)) or dependent (\(y\)) variable in the model.
\[\begin{equation} r^2 = \frac{\beta^2\sigma^2(x)} {\sigma^2(y)} \tag{2} \end{equation}\]
GoDMC also provided the mean levels of DNA methylation at each site. Heritability of DNA methylation at each site has been previously estimated by McRae et al. 2014 and Van Dongen et al. 2016. These values were kindly made publically available by the authors of those studies, in this chapter the estimates of heritability from twin data (Van Dongen et al. 2016) were used.
LOLA, along with data from Roadmap Epigenomics and ENCODE was used to assess the enrichment of DMPs in transcription factor binding sites and chromatin states.
The results are split into the following sections:
Below are tables and figures describing the data.
| study-trait | value |
|---|---|
| Number of EWAS | 2108 |
| Unique traits | 1995 |
| Number of samples | 6353039 |
| Median sample size (range) | 4,170 (93 - 17010) |
| Number of associations | 205113 |
| Unique CpGs identified | 145730 |
| Unique genes identified | 19737 |
| Sex (%) | Both (78.7), Females (18.4), Males (1.9) |
| Ancestries | Unclear (52.65), EUR (38.03), AFR (3.11), Other (2.54), ADM (2.23), EAS (0.83), SAS (0.61) |
| Age (%) | Adults (84.1), Children (5.7), Infants (5.2), Geriatrics (4.0) |
| Number of tissue types | 53 |
| Most common tissues (%) | whole blood (88.21), cord blood (4.39), placenta (1.11), cd4+ t-cells (0.95), saliva (0.83) |
Figure 1: The 10 most common trait names and EFO term labels.
Figure 2: EWAS sample characteristics
Figure 3: Number of unique traits associated with DNA methylation at each CpG. Sites associated with more than 10 unique traits are highlighted in orange and labelled.
Figure 4: Distribution of r2 values across all CpG sites in The EWAS Catalog
Figure 5: Distribution of the sum of r2 values across each study in The EWAS Catalog.
Each study may have reported results across multiple EWAS models, adjusting for different covariates. In at least one model, 1863 studies adjusted for batch effects, 1983 studies adjusted for cell composition, and 1777 adjusted for both. Of all DMPs identified, 10% were measured by potentially faulty probes and an extra 0.46% were present on sex chromosomes (Figure 6).
Figure 6: The percentage of DMPs that may have been identified by faulty probes and the percentage of EWAS that reported identifying at least one of these probes. The left-hand bar represents all DMPs reported across all EWAS that fit into the categories shown, the right-hand bar represents the number of EWAS that include CpGs that fit into the categories shown. Some CpGs are both on a sex chromosome and were identified as faulty by Zhou et al. They were labelled as ‘potentially faulty’.
There were 39 studies that performed a meta-analysis of discovery and replication samples. A further 71 studies performed a separate replication analysis. Together, this provides 2770 associations within the EWAS Catalog that have been replicated at P < 1x10-4.
Across the re-analysed GEO studies, between 0% and 96.875% of DMPs were replicated at P < 1x10-4 (Table 2). Some of these EWAS reported very few DMPs (some only 1) and as they would have used different models, replicating the single reported result was not expected.
| Trait | N-DMPs | N-replicated | Percent-replicated |
|---|---|---|---|
| Age at menarche | 1 | 0 | 0.00 |
| Arsenic exposure | 12 | 0 | 0.00 |
| Fetal alcohol spectrum disorder | 19 | 1 | 5.26 |
| Inflammatory bowel disease | 14 | 13 | 92.86 |
| Nevus count | 1 | 0 | 0.00 |
| Psoriasis | 16 | 0 | 0.00 |
| Rheumatoid arthritis | 47,875 | 116 | 0.24 |
| Smoking | 32 | 31 | 96.88 |
| Smoking | 30 | 12 | 40.00 |
Using EWAS results of the same traits across different studies, replication could be assessed. Figure 7 shows the process of preparing/selecting data for examinning replication. Figure 8 shows the number of EWAS per trait (for traits with >1 EWAS). Figure 9 - 13 shows the replication across the 5 traits selected. For the heatmaps, each of the rows/columns represents an EWAS with the code: “STUDY-NUM_ARRAY_TISSUE”, e.g. “1_HM450_Who” = the first study and for that EWAS the study measured DNAm in whole blood using the 450k array. The heatmaps should be read row-by-row. For each row, the results of the EWAS were restricted to those at P < 1x10-7 and the CpG sites were extracted. The CpGs were looked up in the other EWAS of the same trait, but without restricting the P value threshold below that of the catalog’s threshold (P < 1x10-4).
Figure 7: Preparing EWAS results for replication analyses The number of EWAS that a trait requires in order to be included into the analyses was chosen by examining Figure 8
Figure 8: Number of EWAS per trait
Figure 9: BMI heatmap
Figure 10: Alcohol consumption heatmap
Figure 11: Birthweight heatmap
Figure 12: Smoking heatmap
Figure 13: Maternal smoking during pregnancy heatmap
Before assessing what CpG characteristics might, in part, explain some associations found in EWAS, sites were removed that were identified by potentially faulty probes and were on either of the sex chromosomes. Further, studies that did not include batch effects and cell composition as covariates in at least one EWAS model were removed. Overall, this left 1918 EWAS and 151322 associations (at P<1x10-4).
Figure 14: Distribution of beta values and DNA methylation levels after various transformations. The distribution of DNAm levels comes from the mean methylation levels of CpG sites across the GoDMC cohorts.
Figure 15: Associations of both h2 and DNAm variance with effect size. Variance and h2 were taken from GoDMC data.
Figure 16: Model performance. Testing the performance of the model: beta ~ variance + h^2^
Figure 17: Differences in h2 and variance between CpGs that have replicated and those that have not. h2 and variance taken from GoDMC data. kw-p = p value from a kruskal-wallis test. med-diff = difference between medians.
Figure 18: Predicting whether a CpG will be a DMP using h2 and variance. ROC curves from DMP ~ h^2^ + variance, DMP ~ h^2^, DMP ~ variance
Five different groups of DMPs were defined for the enrichment analyses:
When doing enrichment analyses a background is needed to test the DMPs of interest against. Figure X shows a representative plot how similar the GC frequency was between DMPs and the background.
Figure 19: GC frequency in the DMPs of interest and the background CpGs.
Figure 20: Enrichment of DMPs for 25 chromatin states. Chromatin states across the genome of 127 cell types comprising 25 distinct tissues were available from the Roadmap Epigenomics Project. Using LOLA, the enrichment of DMPs from across all data in The EWAS Catalog for chromatin states were assessed. DMPs were divided into five categories as detailed in above. The x-axis show the 25 chromatin states: TssA, Active TSS; PromU, Promoter Upstream TSS; PromD1, Promoter Downstream TSS with DNase; PromD2, Promoter Downstream TSS; Tx5’, Transcription 5’; Tx, Transcription; Tx3’, Transcription 3’; TxWk, Weak transcription; TxReg, Transcription Regulatory; TxEnh5’, Transcription 5’ Enhancer; TxEnh3’, Transcription 3’ Enhancer; TxEnhW, Transcription Weak Enhancer; EnhA1, Active Enhancer 1; EnhA2, Active Enhancer 2; EnhAF, Active Enhancer Flank; EnhW1, Weak Enhancer 1; EnhW2, Weak Enhancer 2; EnhAc, Enhancer Acetylation Only; DNase, DNase only; ZNF/Rpts, ZNF genes & repeats; Het, Heterochromatin; PromP, Poised Promoter; PromBiv, Bivalent Promoter; ReprPC, Repressed PolyComb, Quies, Quiescent/Low.
Figure 21: Enrichment of DMPs for 167 transcription factor binding sites. Using LOLA, the enrichment of DMPs from across all data in The EWAS Catalog for 167 transcription factor binding sites confirmed across 25 distinct tissues were assessed. DMPs were divided into five categories as detailed in above. The x-axis show the 25 distinct tissues. All transcription factor binding sites have not been confirmed across all tissues. For some tissues (e.g. “Eye” and “Gingiva”) only five have been confirmed, but in blood over 131 have been confirmed.